Matt Massie, UC Berkeley Computer Sciences
Machine learning (ML) is data driven. Machine learning algorithms are constructed to learn from and make predictions on data instead of having strictly static instructions.
Supervised (e.g. classification) vs Unsupervised (e.g. anomaly detection) learning
In this short talk, we'll explore the freely available Breast Cancer Wisconsin Data Set on the University of California, Irvine Machine Learning site.
Data set creators:
In [1]:
import numpy as np
import pandas as pd
def load_data(filename):
import csv
with open(filename, 'rb') as csvfile:
csvreader = csv.reader(csvfile, delimiter=',')
df = pd.DataFrame([[-1 if el == '?' else int(el) for el in r] for r in csvreader])
df.columns=["patient_id", "radius", "texture", "perimeter", "smoothness", "compactness", "concavity", "concave_points", "symmetry", "fractal_dimension", "malignant"]
df['malignant'] = df['malignant'].map({2: 0, 4: 1})
return df
In [2]:
training_set = load_data("data/breast-cancer.train")
test_set = load_data("data/breast-cancer.test")
print "Training set has %d patients" % (training_set.shape[0])
print "Test set has %d patients\n" % (test_set.shape[0])
print training_set.iloc[:, 0:6].head(3)
print
print training_set.iloc[:, 6:11].head(3)
In [3]:
training_set_malignant = training_set['malignant']
training_set_features = training_set.iloc[:, 1:10]
test_set_malignant = test_set['malignant']
test_set_features = test_set.iloc[:, 1:10]
This image shows how support vector machine searches for a "Maximum-Margin Hyperplane" in 2-dimensional space.
The breast cancer data set is 9-dimensional.
Image by User:ZackWeinberg, based on PNG version by User:Cyc [CC BY-SA 3.0], via Wikimedia Commons
In [4]:
from sklearn.preprocessing import MinMaxScaler
from sklearn import svm
# (1) Scale the 'training set'
scaler = MinMaxScaler()
scaled_training_set_features = scaler.fit_transform(training_set_features)
# (2) Create the model
model = svm.LinearSVC(C=0.1)
# (3) Fit the model using the 'training set'
model.fit(scaled_training_set_features, training_set_malignant)
# (4) Scale the 'test set' using the same scaler as the 'training set'
scaled_test_set_features = scaler.transform(test_set_features)
# (5) Use the model to predict malignancy the 'test set'
test_set_malignant_predictions = model.predict(scaled_test_set_features)
print test_set_malignant_predictions
In [5]:
from sklearn import metrics
accuracy = metrics.accuracy_score(test_set_malignant, \
test_set_malignant_predictions) * 100
((tn, fp), (fn, tp)) = metrics.confusion_matrix(test_set_malignant, \
test_set_malignant_predictions)
print "Accuracy: %.2f%%" % (accuracy)
print "True Positives: %d, True Negatives: %d" % (tp, tn)
print "False Positives: %d, False Negatives: %d" % (fp, fn)